May 29, 2015

Correction

Teaching data science to undergrads: an ex-Googler’s Reed professor's tales from the trenches.

HUM 110

All Reed freshman are required to take Humanities 110.

Data Science

Class Description

Class Structure

Prereqs

Only intro stats and some exposure to the statistical software language R. No programming experience necessary.

Syllabus

  • 5 biweekly mini-reports submitted in R Markdown: reproducible research
  • Term project: both report and 20 min oral
  • In-class participation

Classroom: ETC 208

Classroom: ETC 208

Demographics

18 students, mostly juniors and seniors.

Major Count
Mathematics 4
Biological Science: Biology & Biochem and Molecular Biology 4
Other Science: Chemistry, Environmental Studies, Physics 4
Social Science: Political Science, Sociology 2
Economics 2
Misc: Psychology, Linguistics 2

In Practice

  • Real data i.e. messy & needing cleaning, from potentially disparate sources
  • Bottom-up: Let questions/data motivate the statistical methodology, rather than vice-versa
  • Discussions in class
  • Lean on R heavily

Tools

Environment: RStudio

How to get students to use R?

  • Key: Forget base R
  • How? The Hadleyverse of packages by Hadley Wickham.
  • In particular
    • dplyr package for data wrangling/manipulation
    • ggplot2 package for data visualization

dplyr Package

Features

  • Data manipulation is performed using verbs
  • The pipe %>% command, pronounced "THEN"

Example: Houston Flights Dataset

Info on all domestic flights leaving Houston (IAH) in 2011:

  • flights: info on 227,496 flights
  • planes: info on 2853 airplanes

What are the top 5 carriers using the oldest planes (averaged over all flights)?

Flights

The flights dataset:

date dep arr carrier flight dest plane
2011-01-01 1400 1500 AA 428 DFW N576AA
2011-01-02 1401 1501 AA 428 DFW N557AA
2011-01-03 1352 1502 AA 428 DFW N541AA
2011-01-04 1403 1513 AA 428 DFW N403AA
2011-01-05 1405 1507 AA 428 DFW N492AA

Planes

The planes dataset:

plane year model mfr no.seats
N576AA 1991 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N557AA 1993 KITFOX IV MARZ BARRY 2
N403AA 1974 S55A RAVEN 1
N492AA 1989 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N262AA 1985 DC-9-82(MD-82) MCDONNELL DOUGLAS 172

Example: Age of Planes

The following sequence of verbs wrangle/manipulate the data:

left_join(flights, planes, by='plane') %>%
  select(carrier, plane, year) %>%
  mutate(age = 2011 - year) %>%
  group_by(carrier) %>%
  summarise(avg_age = mean(age)) %>%
  arrange(desc(avg_age)) %>%
  top_n(5)

Example: Age of Planes

carrier avg_age
MQ 29.421
AA 24.325
DL 20.760
US 19.078
UA 14.635

ggplot2: the Grammar of Graphics

     

ggplot2: the Grammar of Graphics

-A statistical graphic consists of a mapping of data variables to aesthetic attributes of geometric objects that we can observe. - ggplot2 allows us to construct graphics in a modular fashion by specifying the elements of the grammar.

ggplot2: the Grammar of Graphics

Minard's map of Napoleon's Russian campaign of 1812:

ggplot2: the Grammar of Graphics

Data (Variable) Geometric Object Aesthetic Attribute of Geo Obj
longitude points x position
latitude points y position
army size bars width
army direction bars color
date text (x,y) position
temperature lines (x,y) position

Results

Delayed Flights

Age of Airplanes

Dataset: OkCupid Data

  • Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\))
  • 40.2% of the sample was female
  • Use logistic regression to predict gender

Job

Self-Referenced Body Type

Dataset: Reed Jukebox

All 222,540 songs played on the Reed pool hall jukebox from 2003-2009 c/o Noah Pepper '09

Dataset: Reed Jukebox

date_time artist album track
Sun Dec 7 05:12:57 2003 Tom Petty and the Heartbreakers Into the Great Wide Open
Sun Dec 7 05:15:56 2003 Jefferson Airplane Somebody To Love
Sun Dec 7 05:23:04 2003 Led Zeppelin Led Zeppelin IV 08 When The Levee Breaks

Artist Popularity

Time Series

Maps

Interactive Shiny Apps

The Future

Statistics' Image Problem

  • You hear this a lot:
    • Statistician: Hi, I'm a statistician.
    • Non-statistician: Statistics? I hated that class!
  • You'll never hear this:
    • Statistician: Hi, my work involves a lot of data visualization.
    • Non-statistician: Data visualization? I hate that stuff!

Solution: Data Visualization

  • Data visualization is a backdoor way to get students interested in statistics.
  • Prez from Season 4 of "The Wire":

Impact on my Intro Stats Classes

This is the only stats class many will take.

  • Get them looking at, manipulating, and visualizing data QUICK
  • Better integration of lectures and labs: technology
  • Flipped-classroom model
    • Lab exercises at home
    • Problem solving/debugging and discussion in class

Conclusions

Take Home Messages

  • A class focused on the data first, methods second, using only open-source tools.
  • Interactivity boosts student interest
  • Rich Majerus wrote "Why should students at a small liberal arts college learn R?"
  • New tools like Datacamp are increasing the ratio: \[\frac{\mbox{Payoff from learning R}}{\mbox{Startup costs}}\]
  • Data visualization is a gateway drug for statistics
  • Developing skills and intuition takes time. Instructor attention and feedback are crucial.

Resources